-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve multicore SDR #1501
Improve multicore SDR #1501
Conversation
2d0a045
to
c01dc6a
Compare
It seems that rust-fil-proofs/storage-proofs-porep/src/stacked/vanilla/create_label/multi.rs Lines 173 to 176 in 9cc6444
Otherwise, the current fill & get implemention should not be thread safe too? 🤔
|
29ee4f7
to
017af1a
Compare
The original code was
instead of
which only sets the bit, without clearing. Which resulted in the bug you discovered. |
The last change speeds up replication from 12min to 11min on my machine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dignifiedquire and I went over this, and I believe it now clears the bitmask correctly (and in correct place). This is consistent with the observation that original code (no clearing) was slowest, but first attempted fix was much faster but failed tests. Given that we also have an understanding of the original transcription error that led to the original code, I'm pretty confident in this iteration.
Nevertheless, especially since we've been bitten once, we should make sure this code is tested at least as much as the first attempted fix was. Remember, the discovered test failures were non-deterministic and not expected to always manifest. We need to test this enough times to have reasonable expectation that any failures introduced will have showed up. If we see any such failures, we need to take them seriously (as we did with the previous).
Even though I'm more confident in this than I was in the previous, I'm blocking merging here until I hear that we have successfully tested this (as thoroughly as we eventually did the previous). The reason I am requesting changes rather than approving with 'conditions' is that the latter approach didn't prevent the code from being merged last time. So this is a slight process adjustment based on what we learned in the last round: we need a bit of extra diligence here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran the small lifecycle tests looped overnight without issue. This is where the failure showed up both on CI and on my machine in the past.
97e3886
to
4af8e48
Compare
Rebased into clear commits, the different changes. |
Did you want me to re-test here, or are we good to go? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on @cryptonemo's testing — which he believes is sufficient to have caught the previous error but uncovered no problems here — I am approving this now.
This reverts commit ff7daa9.
Based on #1477 this updates the parent selection logic in multicore sdr.
It does not implement the bitmask update, as that is not thread safe, and only the thread that fills the buffer is allowed to set this value. This I missed in my earlier review.